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EFFICIENT COMPUTATION OF THE JOINT SAMPLE 
FREQUENCY SPECTRA FOR MULTIPLE POPULATIONS 


By John A. Kamm* *, Jonathan TerhorsU and 
Yun S. Song*’^ 

University of California, Berkeley 

A wide range of studies in population genetics have employed 
the sample frequency spectrum (SFS), a summary statistic which de¬ 
scribes the distribution of mutant alleles at a polymorphic site in 
a sample of DNA sequences. In particular, recently there has been 
growing interest in analyzing the joint SFS data from multiple pop¬ 
ulations to infer parameters of complex demographic histories, in¬ 
cluding variable population sizes, population split times, migration 
rates, admixture proportions, and so on. Although much methodolog¬ 
ical progress has been made, existing SFS-based inference methods 
suffer from numerical instability and high computational complexity 
when multiple populations are involved and the sample size is large. 
In this paper, we present new analytic formulas and algorithms that 
enable efficient computation of the expected joint SFS for multiple 
populations related by a complex demographic model with arbitrary 
population size histories (including piecewise exponential growth). 
Our results are implemented in a new software package called momi 
( MO ran Models for Inference). Through an empirical study involving 
tens of populations, we demonstrate our improvements to numerical 
stability and computational complexity. 


1. Introduction. The sample frequency spectrum (SFS) is the distri¬ 
bution of allele frequencies at a polymorphic site in a collection of DNA 
sequences randomly drawn from a population. This summary statistic is 
used in a variety of inference problems in population genetics [5, 11, 13, 16, 
17, 18, 19, 22, 33, 34, 42], often in the context of likelihood-based analysis 
of single nucleotide polymorphism (SNP) data. Over the past several years, 
there has been much interest in analyzing the joint SFS data from multiple 
populations to infer complex demographic models involving population size 
changes, population splits, migration, and admixture. Inferring population 
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demographic histories is not only intrinsically interesting, for example in dat¬ 
ing events such as the out-of-Africa migration of modern humans [19, 38], 
but is also important for biological applications, such as distinguishing be¬ 
tween the effects of natural selection and demography [3, 6]. 

Likelihood-based inference methods using the SFS require accurate com¬ 
putation of the expected SFS under a given demographic model. As further 
detailed below, however, existing methods suffer from numerical instability 
and high computational complexity when multiple populations are involved 
and the sample size is large. The joint SFS for multiple populations describes 
the distribution of joint allele frequencies across the different populations. 
In this paper, we tackle the problem of computing the expected joint SFS 
for many populations, given a complex demographic model relating them. 

The SFS has been studied in the context of two dual processes, the Wright- 
Fisher diffusion [25] and Kingman’s coalescent [15], and both approaches can 
be used to compute the multi-population SFS. In the diffusion approach of 
Gutenkunst et al. [19], which was later further extended [17, 31], one numer¬ 
ically solves partial differential equations forward in time to approximate the 
distribution of joint allele frequencies at present. The diffusion framework 
has the advantage of being applicable to arbitrary demographic models, but 
its computational complexity grows exponentially with the number of pop¬ 
ulations, and current implementations have difficulty handling more than 
three [19] or four populations [31]. 

In the coalescent approach, the SFS is computed by integrating over all 
genealogies underlying the sample. This can be done via Monte Carlo or 
analytically. Monte Carlo integration approach [34] can effectively handle 
arbitrary demographic histories with a large number of populations, and 
Excoffier et al. [13] have recently developed a useful implementation. How¬ 
ever, when the number T) of populations (or demes) is moderate to large, 
most of the 0{'nP) SFS entries, where n denotes the sample size, will be 
unobserved in simulations, and thus the Monte Carlo integral may naively 
assign a probability of 0 to observed SNPs. Monte Carlo computation of 
the SFS thus requires careful regularization techniques to avoid degeneracy 
issues. 

An alternative to the Monte Carlo approach is to compute the SFS ex¬ 
actly via analytic integration over coalescent genealogies [18, 42]. For a de¬ 
mography involving multiple populations, this can be done efficiently by a 
dynamic program [8, 9]. This algorithm is more complicated and less general 
than both the Monte Carlo and diffusion approaches: while it can handle 
population splits, merges, size changes, and instantaneous gene flow, it is 
difficult to include continuous gene flow between populations. However, it 
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scales well to a large number T) of populations, since it only computes entries 
of the SFS that are observed in the data, and ignores the 0{nP) SFS entries 
that are not observed. Unfortunately, existing coalescent-based algorithms 
[8, 9, 42] do not scale well to a large sample size n, both in terms of run¬ 
ning time and numerical stability. In particular, the algorithm relies on large 
alternating sums that explode with n and exhibit catastrophic cancellation. 

In this paper, we significantly improve the computational complexity and 
numerical stability of the coalescent approach. We show how the alternating 
sums can be avoided altogether, and replaced with faster and more stable 
formulas. Moreover, we introduce a second speedup by replacing the coales¬ 
cent with a Moran model in the dynamic program. 

The dynamic program algorithm to compute the SFS involves splitting 
the demography into its component subpopulations, each of which contains 
a single population coalescent, but truncated at some time r in the past. 
In Section 2, we focus on this truncated coalescent. In particular, we focus 
on computing the truncated SFS fn{k), the expected number of mutations 
arising in the time interval [0, r) which subtend exactly k out of n individuals 
sampled at time 0. We give an algorithm for computing fn{k) efficiently, 
using recurrence relations combined with results from Polanski, Bobrowski 
and Kimmel [36], Polanski and Kimmel [37] and Bhaskar, Wang and Song 
[5]. We also provide an alternative formula for fn{k) based on the coalescent 
with killing. 

In Section 3, we describe the coalescent algorithm of Chen [8, 9], and 
show how our formulas for {k) improve its computational complexity. For 
the special case where the demographic history forms a tree, we introduce 
an additional speedup by replacing the coalescent with a Moran model. For 
such tree-shaped demographies, we can compute the observed SFS entries 
in 0{n^'D + n\og{n)'DL), where n is the sample size, F is the number of 
populations at the present, and L is the number of observed entries in the 
SFS. This is an improvement over the 0{n^F+n'^FL) complexity in Chen [8, 
9]. For more general demographic histories with migration or admixture, the 
algorithm of Chen [8, 9] is 0{n^V+WL), where V is the number populations 
(vertices) throughout the history, and VF is a term that depends on n and 
the graph structure of the demography; we improve this to 0(n^V + WL). 
In future work, we will give explicit expressions for VF, and extend our 
Moran-based speedup to demographies with pulse migration. 

We note that some of our improvements are related to results in Bryant 
et al. [7], whose 0{n^ \og{n)FL) algorithm computes the one-locus likelihood 
for species trees with recurrent mutation and piecewise constant population 
sizes. By contrast, our method, like that of Chen [8, 9], considers an infinite 
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sites model [26] without recurrent mutation, but can handle exponentially 
growing population sizes. In fact, our method goes even further, and easily 
accommodates arbitrary population size changes. 

In Section 4, we demonstrate the improved speed and accuracy of our 
algorithm in a numerical study involving tens of populations. We implement 
and release our algorithm in a publicly available Python package, momi 
( MO ran Models for Inference). Proofs of the mathematical results presented 
in Section 2 are provided in Section 5, 

2. The truncated sample frequency spectrum. 

2.1. Background. We denote Kingman’s coalescent [27, 28, 29] on n 
leaves {C^}t>o to be the backward-in-time Markov jump process, whose 
value at time t is a partition of {!,..., n}, and at time t, each pairs of 
blocks in Cf coalesce with rate a{t). We also call the population size 
history function. We denote the ancestral process = \Cf'\ to be the num¬ 
ber of blocks in Cf, so that is a pure death process with Aq" = n and 
the rate of transition from m to m — 1 given by ~ 

We often drop the dependence on n, and write Ct = Cf and A^i = A^^. 
We prefer to denote a dependence on n through the probability and the 
expectation E„. So if X{C^) denotes a random variable of the process C”, 
we usually write E„[X] instead of E[X(C”)]. 

Let denote the partition of {1,..., n} when Ct has i blocks (also referred 
to as lineages). Let T* = denote the amount of time Ct has exactly 

i lineages. It is a fundamental fact of the coalescent that the waiting times 
Tn :2 = {Tn,... ,T 2 ) ave independent of the partitions ^n :2 = (^n, • • •, ^ 2 ) [27]. 

A sample path of C” can be viewed as a rooted ultrametric tree with n 
leaves labeled 1 ,..., n, where Ct is the partition induced on {!,..., n} by 
cutting the tree at height t. Now suppose we drop mutations onto this tree as 
a Poisson point process with rate |, and let Ai denote the set of leaves that 
are beneath mutations (where we only consider mutations beneath the root, 
so by assumption Ai 7 ^ {1, • • • ,n}). Then we define the sample frequency 
spectrum fnik), for 0 < A: < n, as the first order Taylor series coefficient of 
Pn(|A4| = k) in the mutation rate, 

rni\Ai\ = k) = p^{k) + o{e). 

We will generally refer to fnik) as the sample frequency spectrum (SFS). 

We also note two alternative definitions of the SFS. First, fnik) is the 
expected number of mutations with k descendants when ^ = 1. Second, 
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= {1.4,5} 

a = {{1,2}, {3}, {4, 5}}, A? = 3 

niKTs) 

n (= n) 


Fig 1: A sample path of the coalescent truncated at time r. Star symbols 
denote mutations, while denotes the set of leaves under those mutations, 
denotes the waiting time in the interval [0, r) while there are k lineages. 



j4r\fn{\K\) is the expected length of the branch whose leaf set is K. More 
V|x| j 

specifically, let I denote the indicator function, and define Ck ■= /g°° ^KeCtdt- 
Then 

= ^n[CK]. 

\\K\) 

The equivalence of these alternate definitions follows from previous results 
in Bhaskar, Kamm and Song [4], Griffiths and Tavare [18], Jenkins and Song 
[23]. 

Note the SFS is sometimes defined to be a normalized version of fn{k), 
so that the entries sum to 1. We do not follow that convention, and use the 
unnormalized definition for the SFS throughout this paper. 

2.2. The truncated coalescent and SFS. We now consider truncating the 
coalescent with mutation at time r, as illustrated in Figure 1. Let denote 
the set of leaves under mutations occurring in the time interval [0,t). We 
define the truncated SFS fn{k) according to 

Fn{\M^\ = k) = ^-f-{k) + o{e). 

By the same arguments as for the untruncated SFS, one can show that 
fn{k) gives the expected number of mutations in [0,r) with k descendants, 
and letting := J^IxeCtdi denote the branch length subtending K C 
{1 ,..., n} within [0, r), we have 

= E„[£],]. 
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Note that for k < n, we have fn{k) = f^{k). For the truncated SFS, 
we will also consider mutations above the root, and so allow k = n (i.e., 
n}), with fn{n) = ^|] giving the expected number 

of mutations within [0, r) subtending the whole sample. 

Given a random variable X, we define conditional versions of the SFS 
/^(/c I X) according to 

¥n{\M^\=k\x) = ^-fi(k\x) + o{e). 

An example of a useful conditional SFS is | = m), the expected 

branch length subtending k leaves given m ancestors at time r. In particular, 
Chen [8] devised a dynamic program algorithm to compute the joint SFS 
for multiple populations under complex demographic histories, by computing 
on each subpopulation of the history, where r is the length of 
time a particular subpopulation exists. The unconditional SFS fl{k) is in 
turn computed in terms of {k \ = m) by writing 

n—k-\-l 

(1) fl{k) = ^ = m)fl{k I = m). 

m=l 

In Section 3.1, we describe the dynamic program algorithm for computing 
the joint SFS for multiple populations, and the way in which this algorithm 
uses the terms fl{k). 

We consider how to compute (1). The first term in the summand, = 

m), can be computed in at least three ways: by numerically exponentiating 
the rate matrix of A^, by computing an alternating sum with 0{i') terms 
[41], or by solving a recursion we describe in Section 5.1. We note that the 
recursion described in Section 5.1 has the advantage of computing all values 
of Pj/(A^ = m), m < V < n, m 0{'n?) time. 

The second term fy{k \ A^ = m) in the summand of (1) is computed in 
Chen [8] as 

V 

(2) f^[k\ A^ = m) = Y^ | A^ = m], 

i=m 

where 

k > j > 0 and u — k > i — j >0, 

if j = /c = 0 or i — j = — /c = 0, 
else. 


P. 


k,j _ 








(■:D 


1 , 

0 , 
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is the transition probability of the Polya urn model, starting with i — j white 
balls and j black balls, and ending with v — k white balls and k black balls 
[24], and 

Tl := lA^=idt 

is the length of time in [0, r) where there are i ancestral lineages to the 
sample, as illustrated in Figure 1. Chen [8] provides a formula for the con¬ 
ditional expectation 'E,y[T[ | ^4^ = m] for the case of constant population 
size, which he later extends [9] to the case of an exponentially growing pop¬ 
ulation. However, these formulas involve a large alternating sum with 0(z^^) 
terms. Thus, computing \ = m] for every value of as re¬ 

quired to compute {fl{k)}k<u<n with (1) and (2), takes 0{n^) time with 
these formulas. In addition, large alternating sums are numerically unstable 
due to catastrophic cancellation [20], and so these formulas require the use 
of high-precision numerical libraries, further increasing runtime. 

2.3. An efficient, numerically stable algorithm for computing the trun¬ 
cated SFS. Here, we present a numerically stable algorithm to compute, 
for a given positive integer n, all of {ff{k) j 1 < A: < < n} in Offi?) time 

instead of O(ffi) time. Our approach utilizes the following two lemmas: 

Lemma 1. The entry ff{n) of the truncated SFS is given by 

n—1 j 

( 3 ) /,;(«) = T-Y^-rjk). 

k=l 

Lemma 2. For all 1 < k < n, the truncated SFS fl{k) satisfies the 
linear recurrence 

(4) /;W = 

We prove Lemma 1 in Section 5.2. We note here that our proof also yields 
the identity E[rMRCA] = I2k=i ^fn(k), where Tmrca denotes the time to 
the most recent common ancestor of the sample; to our knowledge, this 
formula is new. A proof of Lemma 2 is provided in Section 5.3. 

We now sketch our algorithm. For a given n, we show below that all values 
of ff{k), for 1 < k < n, can be computed in time. We then compute 

ff{n) using Lemma 1 in 0(n) time. Finally, using fn{k) for 1 < A: < n 
as boundary conditions. Lemma 2 allows us to compute all ff{k), for v = 
n — 1, n — 2,..., 2 and A: = 1,..., z/, in 0{ffi) time. 
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We now describe how to compute the aforementioned terms fn{k), for all 
k < n, in 0{'n?‘) time. We first recall the result of Polanski and Kimmel [37] 
which represents the untruncated SFS /,i(A:), for 1 < /c < n — 1, as 

n 

( 5 ) fn{k) = ^ ^ ^^n,k,mCrm 

m=2 


where 



denotes the waiting time to the first coalescence for a sample of size m, 
and Wn,k,m 3-re universal constants that are efficiently computable using the 
following recursions [37]: 


W, 


n,k,2 


6 

n + 1 ’ 


a,fc,3 


^n,/c,m+2 


(n + l)(n + 2)’ 

(1 + m)(3 + 2m)(n — m) 

- -—-—-- Wn 

"^{2m — l)(n + m + 1) 


m 


(3 + 2m) (n - 2A:) 
m(n + m + l) n,k,m+l^ 


(7) 


for 2 < m < n — 2. The key observation is to note that, in a similar vein as 
(5), we have: 


Lemma 3. The truncated SFS fn{k), for 1 < k < n — 1, can be written 
as 

n 

( 8 ) fnik) = ^n,k,mCl„ 

m=2 

where c!^ is a truncated version of (6).' 


c 


r 

m 







a{x)dx 


(9) 


0 


exp 


dt. 
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We prove Lemma 3 in Section 5.4. For piecewise-exponential a{t), can 
be computed explicitly using formulas from Bhaskar, Wang and Song [5]. 
Using (7), we can compute all values of Wn^k,m, for 1 < A: < n and 2 < m < 
n, in 0{n'^) time. Then, using (8), all values of /^(A;), for 1 < A: < n — 1 can 
be computed in 0{n‘^) time. 

Note that the above algorithm not only significantly improves computa¬ 
tional complexity, but also resolves numerical issues, since it allows us to 
avoid computing the expected times \ = m], which are alternat¬ 

ing sums of 0{'n?) terms and are numerically unstable to evaluate for large 
values of n (say, n > 50). 

2.4. An alternative formula for pieeewise-constant subpopulation sizes. 
For demographic scenarios with pieeewise-constant subpopulation sizes, we 
present an alternative formula for computing the truncated SFS within a 
constant piece. This formula has the same sample computational complex¬ 
ity as that described in the previous section. 

Let ICt denote the coalescent with killing, a stochastic process that is 
closely related to the Chinese restaurant process, Hoppe’s urn, and Ewens’ 
sampling formula [2, 21]. In particular, the coalescent with killing {ICt}t>o is 
a stochastic process whose value at time t is a marked partition of {1 ,..., n}, 
where each partition block is marked as “killed” or “unkilled”. We obtain 
the partition for ICt by dropping mutations onto the coalescent tree as a 
Poisson point process with rate ^, and then defining an equivalence relation 
on {1 ,..., re}, where z ~ j if and only if i, j have coalesced by time t and there 
are no mutations on the branches between i and j (i.e., i and j are identical 
by descent). We furthermore mark the equivalence classes (i.e. partition 
blocks) of ICt that are descended from a mutation in [0, t) as “killed”. See 
Figure 2 for an illustration. The process ICr can also be obtained by running 
Hoppe’s urn, or equivalently the Chinese restaurant process, forward in time 
[12, Theorem 1.9]. 

Let A^ be the number of unkilled blocks in ICt, so that A^ is a pure death 
process with transition rate Xft_i{t) = ( 2 )a(A)-|-^ (the rate of coalescence is 
the number of unkilled pairs (^a{t), and the rate of killing due to mutation 
is y). Our next proposition gives a formula for the truncated conditional 
sample frequency spectrum given A^, i.e., ff{k \ A^ = rre). 

Proposition 1. Consider the constant population size history = X 
for t G [0, t), and let m > 0 and 0 < k < n — m. The joint probability that 
the number of derived mutants is k and the number of unkilled ancestral 
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/C. = {{!}, {2}, {3}, {4, 5}} 



1 2 3 4 5 

Fig 2: The coalescent with killing for the genealogy in Figure 1. Note that 
JCr is a marked partition, with the blocks killed by mutations in [0, r) being 
specially marked. 


lineages is m, when truncating at time t, is given by 

lPn{\M'"\ = k,A’^ = m) = ^f^{k \ A’^ = m)¥{A^ = m) + o{9), 

where 

cy /n-m\ 

( 10 ) m\A’^ = m) = 

We prove Proposition 1 in Section 5.5. Note that this equation does not 
hold for the case k = n,m = 0, but fortunately we do not need to consider 
that case in what follows below. 

We can use Proposition 1 to stably and efficiently compute the terms 
fl{k), for k < V < n, as, follows. We first compute the case k < v = n. 
Note that Pn(|AF^| = K) = = K,A^ = m). So for k < n, hy 

Proposition 1 

n 

fnik) = Y.fnik\A^ = m)¥niA‘; = m) 
m=l 

n ry (n—m\ 

( 11 ) = 

m=l \ k ) 

The sum in (11) contains 0(n) terms, so it costs 0{n^) to compute fn{k) 
for all k < n. After this, we use Lemma 1 to compute /^(n), and then use 
Lemma 2 to compute fl{k) for all 1 < /c < i/ < n. Since there are Oiv?) 
such terms, this also takes 0{n^) time. 
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Fig 3: A demographic history with a pulse migration event (left), and its 
corresponding directed graph (right). 


3. The joint SFS for multiple populations. In this section we dis¬ 
cuss an algorithm for computing the multi-population SFS [8, 9, 42]. We 
describe the algorithm in Section 3.1, and note how the results from Sec¬ 
tion 2 improve the time complexity of this algorithm. In Section 3.2, we 
focus on the special case of tree-shaped demographies, and introduce a fur¬ 
ther algorithmic speedup by replacing the coalescent with a Moran model. 

Let V be the number of subpopulations in the demographic history, n 
the total sample size, and L the number of SFS entries to compute. Then 
the results from Section 2 improve the computational complexity of the SFS 
from 0(n^V + WL) to 0(n^F + WL), where IT is a term that depends on 
the structure of the demographic history. In the special case of tree-shaped 
demographies, the algorithm from Chen [8] gives W = 0{n^V). The Moran- 
based speedup from Section 3.2, combined with results from Bryant et al. 
[7], improves this toW = 0{nlog{n)V). 

The Moran-based speedup can be generalized to non-tree demographies, 
but the notation, implementation, and analysis of computational complexity 
becomes substantially more complicated. We thus leave its generalization to 
future work. 

3.1. A coalescent-based dynamic program. Suppose at the present we 
have T) populations, and in the ith population we observe n* alleles. For 
a single point mutation, let x = (xi,... ,xd) denote the number of alleles 
that are derived in each population. We wish to compute /(x), where ^/(x) 
is the expected number of point mutations with derived counts x. 

For demographic histories consisting of population size changes, popula¬ 
tion splits, population mergers, and pulse admixture events, Chen [8] gave an 
algorithm to compute /(x) using the truncated SFS fn{k) that we defined 
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in Section 2. 

We describe this algorithm to compute /(x). We start by representing the 
population history as a directed acyclic graph (DAG), where each vertex v 
represents a subpopulation (Figure 3). We draw a directed edge from v to 
v' if there is gene flow from the bottom-most part of v to the top-most part 
of v', where “down” is the present and “up” is the ancient past. Thus, the 
leaf vertices correspond to the subpopulations at the present. For a vertex 
V in the population history graph, let G (0, oo) denote the length of time 
the corresponding population persists, and let : [ 0 , r^) —>■ M"'' denote 
the inverse population size history of u. So going backwards in time from 
the present, a^,(^) gives the instantaneous rate at which two lineages in v 
coalesce, after v has existed for time t. We use fn{k) to denote the truncated 
SFS for the coalescent embedded in v, i.e., fn{k) = fn'"{k) for a coalescent 
with coalescence rate a^{t). Then we have 

(12) /(x) = ^ 

V m^,k^ 

where mg denotes the number of lineages at the bottom of v that are ances¬ 
tral to the initial sample, and /cg denotes the number of these lineages with 
a derived allele. 

In order to use (12), we must compute /m«(^o) every population v, 
and every value of mg and /cg. If n is the total sample size and V the total 
number of vertices, then this takes O(n^V) time using the formulas of Chen 
[ 8 ]. Our results from Section 2 improve this to O(n^V). 

To use (12), we must also compute the terms P(x | feg,mQ)P(mg), for 
which Chen [ 8 ] constructs a dynamic program, starting at the leaf vertices 
and moving up the graph. This dynamic program essentially consists of 
setting up a Bayesian graphical model with random variables mg,A:Q and 
performing belief propagation, which can be done via the sum-product al¬ 
gorithm (“tree-peeling”) if the population graph is a tree [14, 35], or via a 
junction tree algorithm if not [30]. 

The time complexity of the algorithm thus depends on the topological 
structure of the population graph. In the special case where the demo¬ 
graphic history is a binary tree, the tree-peeling algorithm computes the 
values P(x j /cg,mg)P(mQ) in O(n^V) time, since the vertex v has 0{v?) 
possible states (/sg, mg), so summing over the transitions between every pair 
of states costs O(n^). Note that Chen [ 8 ] mistakenly states that the computa¬ 
tion takes 0{n^V) time. In the further special case that the population sizes 
are piecewise constant, speedups from Bryant et al. [7] can improve this 
to 0{'n?\og{n)V). More specifically, Bryant et al. [7] computes the terms 
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P(x I kQ,mQ)F{mQ) in 0{'n?‘\og{n)V) time for a model with recurrent mu¬ 
tation, but the results can be applied straightforwardly here by setting the 
mutation rate to 0, thus disallowing recurrent mutation. 

To summarize, let W be the time it takes to compute (12) after the 
terms f^ik) have been precomputed, and let L be the number of distinct 
entries x for which we wish to compute /(x). Then our results from Section 2 
improve the computational complexity from 0{n^V+WL) to 0{v?V+WL). 
In the case of a binary tree the original algorithm of Chen [8] gives W = 
0{n^V), but adapting results from Bryant et al. [7] improves the this to 
W = 0(n^ log(n)I7) when the population sizes are piecewise constant. In 
the following section, we introduce a new approach that further improves 
the runtime to IT = 0(nlog(n)T) and generalizes from piecewise constant 
to arbitrary population size histories. 

3.2. A Moran-based dynamic program. We describe a modified version 
of the dynamic programs from Bryant et al. [7], Chen [8] that improves 
the computational complexity of computing /(x) for tree-shaped demogra¬ 
phies. The main idea is to replace the backwards-in-time coalescent with a 
forwards-in-time Moran model. 

We assume the T> populations at the present are related by a binary 
rooted tree with V leaves, where each leaf represents a population at the 
present, and at each internal vertex, a parent population splits into two child 
populations. (Note that a non-binary tree can be represented as a binary 
tree, with additional vertices of height 0). 

Instead of working with the multi-population coalescent directly, we will 
consider a multi-population Moran model, in which the coalescent is embed¬ 
ded [32]. In particular, let £(u) denote the leaf populations descended from 
the population u, and let = J2iG£(v} number of present-day alle¬ 

les with ancestry in v. For each population v (except the root), we construct 
a Moran model going forward in time, i.e. starting at Ty and ending at 0. The 
Moran model consists of Uy lineages, each with either an ancestral or derived 
allele. Going forward in time, every lineage copies itself onto every other lin¬ 
eage at rate ^ay{t). Thus, the total rate of copying events is (^”)a„(t). Let 
pf denote the number of derived alleles at time t in population v. Then the 
transition rate of fjf when fjf = x is Ax->-a:-i-i(i) = = ^^"’2 ^^ ay{t), 

since there are x{ny — x) pairs of lineages with different alleles. 

The coalescent is embedded within the Moran model, because if we trace 
the ancestry of genetic material backwards in time in the Moran model, 
we obtain a genealogy with the same distribution as under the coalescent 
(Theorem 1.30 of Durrett [12]). Thus, we can obtain the expected number 
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of mutations with derived counts x, by summing over which population v 
the mutation occurred in: 

(13) /(x) = ^^/"J/c)P(x I =/c). 

V k=l 

Let x^ = {xi : i G -C(n)} denote the subsample of derived allele counts in 
the populations descended from v. Similarly, let x^ = {xi : i ^ £(n)}. Then 
for fc > 1, 


(14) 


P(x \nQ = k) 


P(x^ I = k), if x^ = 0, 
0, if x^ / 0. 


So it suffices to compute P(x^ | /Tq = A:) for all v and k. If v is the ith 
leaf population, then P(x^ \ fiQ = k) = Ik=xi- On the other hand, if v is an 
interior vertex with children vi and ^2, then 
(15) 


X,, 


/^o 


= k) = 


I 

E 

fei=0 


V ki ) \k-ki) 

ft) 


X. 


v\ 




= A:i)P(x, 


V2 


=k-ki), 


where P(x^;. | can be computed from 

Uy 

(16) P(x^ I =k) = ^P(x^ I = j)IP(/x(j = j I = k). 

1=0 


To compute the transition probability IP(/io = J I = k), note that 
the transition rate matrix of /rj" can be written as where QO) = 

{Qif)o<i,j<n^ is a (n + 1) X (n + 1) matrix with 



< li{ny - i), if \j - 1| = 1, 
_0, else. 


so then the transition probability is given by the matrix exponential 
(17) P(^(j = j I = k) = (e^'”' fo^ 

Thus, the joint SFS /(x) can be computed using (13) and (14), with 
P(x^ \ fiQ = k) given by recursively computing (15), (16), and (17), in 
a depth-first search on the population tree (i.e. Felsenstein’s tree-peeling 
algorithm, or the sum-product algorithm for belief propagation). 
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We now consider the computational complexity associated with each ver¬ 
tex V. Equations (15) and (16) each have 0{n^) terms, and must be solved 
for 0{n^) values of k] so naively, each vertex costs O(n^) time. However, we 
can improve (15) to 0{nylog{ny)) and (16) to 0(n„), using essentially the 
same speedups as in Bryant et al. [7]. Letting \ = k), 

(15) can be written as a convolution 


(18) 






IV2 - 


which can be computed in 0(n„ log(nt,)) time via the fast Fourier trans¬ 
form [10], since where T is the discrete Fourier 

transform. Similarly, letting i^{k) = iti^)/Ck)^ turns into 

(19) 


and this costs 0{ny) by the sparsity of using results for computing the 
action of sparse matrix exponentials [1, 39]. Transforming between and 
takes 0{ny) time. 

The computational complexity associated with a single vertex v is thus 
0(n^ log(ut,)). Therefore, computing the joint SFS entry /(x) for L distinct 
values of x takes 0{'n?‘V+n \og{n)VL) time for a binary population tree with 
arbitrary population size functions and no migration. This is a substantial 
improvement over the 0{n^V + n^VL) complexity of Chen [8], and the 
0(n^ log(n)FL) complexity of Bryant et al. [7]. Similar to Chen [8], our 
approach has the benefit of easily generalizing to arbitrary population size 
histories, not just piecewise constant sizes. 


4. Results. We implemented our formulas and algorithm in Python, 
using the Python packages numpy and scipy. We also implemented the for¬ 
mulas from Chen [8, 9], and compared the performance of the two algorithms 
on simulated data. 

We simulated data for demographic trees with V G {5,10,15, 25, 50,100} 
populations at the present, and ^ G {1, 2, 5,10} individuals per population. 
For each value of n, D, we used the program scrm [40] to generate 20 random 
datasets, each with a demographic history that is a random binary tree. 

In Figure 4, we compare the running time of the original algorithm of 
Chen [8, 9] against our new algorithm that utilizes the formulas for /^{k) 
presented in Section 2 and our new Moran-based approach described in 
Section 3.2. We find our algorithm to be orders of magnitude faster; the 
difference is especially pronounced as the number T) of populations grows. 
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Average Evaluation Time 


[-•-I Chen Momi 



Fig 4: Average computation time per joint SFS entry. For each combination 
of the number T) of populations and the sample size re/P per population, 
we generated 20 random datasets, each under a demographic history that is 
a random binary tree. The expected joint SFS for the resulting segregating 
sites were then computed using our method (momi) and that of Chen [8]. 
Average runtime (in seconds) per joint SFS entry is plotted on the y-axis, 
with each panel corresponding to a different value of n/P. As the plots 
show, our algorithm is orders of magnitude faster than Chen’s. Due to its 
significantly increased runtime, we were able to run Chen’s method only up 
to V = 15. 


Note that, due to the increased running time, we were only able to run 
Chen’s algorithm to completion for D < 15. 

In Figure 5, we compare the accuracy of the two algorithms. The figure 
compares the SFS entries returned by the two methods across a subset of 
the simulations depicted in Figure 4. To adequately capture the large range 
of numerical values returned by the Chen method, we transformed each SFS 













17 


Numerical Stability 



- 20 ^-1-1-1-1---r 

0 12 3 4 

Momi 


Fig 5: Numerical stability of the two algorithms. The plot compares the 
numerical values returned by our method (momi) and Chen’s method, 
for the simulations described in Figure 4. The three panels on the y- 
axis correspond to V £ {5,10,15}. To adequately illustrate the observed 
range of numerical values, the SFS values were transformed via the map 
z I—>■ sign(z) logio(l + the dashed line represents the identity y = x. 
The two methods agree for T> < 5, but Chen’s method displays considerable 
numerical instability for 2? > 10. 


entry using the transformation z i—)• sign(z) log]^Q(l + |z|). The line y = x 
is also plotted; points falling on the line depict the SFS entries where both 
methods agreed. All negative return values represent numerical errors. The 
two methods agree for T> < 5, but Chen’s algorithm displays considerable 
numerical instability for 2? = 10 and higher. 

5. Proofs. In this section, we provide proofs of the mathematical re¬ 
sults presented in earlier sections. 

5.1. A recursion for efficiently computing ¥y(A^ = m). We describe how 
to compute ¥y(AS^ = m), for all values of m < < n, in O(n^) time. First, 
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note that 

= m) 

= Fy{A^ = m + 1, {u} G Ct) + IPiy(^r = {^} ^ Cr) 


(l) 

[m + l)(m) 

K^-1) 


= m + l)+ 1- 


1,1 

mpy\m 


(l) 


= m + 1) + 1 - 


m{m — 1) 


= m) 


= m). 


Rearranging, we get the recursion 

( 20 ) 

Fiy{A^ = m) = 


1 m(m—1) 
^ - u{u-l) 


p„_,(4 = m) - + = m + 1) 


with base cases 


,{A^^ = i^) = e-ii) fo 


So after solving a{t)dt, we can use the recursion and memoization to 
solve for all of the O(n^) terms F^{A^ = m) in 0{n^) time. In particular, in 
the case of constant population size, a{t) = a, the base case is given by 


F^{A^^ = u) = e-O^, 

and in the case of an exponentially growing population size, a(t) = 
the base case is given by 






5.2. Proof of Lemma 1. Let Tmrca denote the time to the most recent 
common ancestor of the sample. We first note that 

fn (n) = T- E„ [Tmrca A r], 

since the branch length subtending the whole sample is the time between r 
and Tmrca- 

Next, note that [Tmrca A r] is equal to the number of polymorphic 
mutations in [0, r) where the individual “1” is derived. This is because, as 
we trace the ancestry of “1” backwards in time, all mutations hitting the 
lineage below Tmrca aie polymorphic, while all mutations hitting above 
Tmrca are monomorphic. 
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The expected number of polymorphic mutations with “1” derived is also 
equal to ^ since if a mutation has k derived leaves, the chance 

that “1” is in the derived set is Thus, 

n—1 

IEn[TMRCA A r] = ^ -f^{k), 

k=l 

which completes the proof. 

5.3. Proof of Lemma 2. We first note that 


Fn{M^ = k}) 

= Fn+iiM-^ = {1,..., A:}) + Fn+i{M^ = {1,. .. ,k,n + 1}). 

By exchangeability, we have = K) = + o{d) for all K C 

{1 ,... ,n}, so 

T/;w = 7HL/„ViW+yTiY/„v,(t+i). 

\kJ \ k ) \k+l) 


Multiplying both sides by (^) gives 

m) = ”~*;|'T „AiW + ^/„Vi(t+i). 

5.4. Proof of Lemma 3. Let a*{t) denote the inverse population size 
history given by 


a{t) if t < 

oo if t > r. 


So the demographic history with population size agrees with the origi¬ 
nal history up to time r, at which point the population size drops to 0 , and 
all lineages instantly coalesce into a single lineage with probability 1 . 

Let Tm,* denote the amount of time there are m ancestral lineages for the 
coalescent with size history Similarly, let fn,*{k) denote the SFS under 
the size history Then from the result of Polanski and Kimmel [37], 


n 

fn,*{^) = ^ Wn,k,'m^m[TmA- 
m=2 
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Note that for m > 1, we almost surely have Tm,* = i.e. the intercoales¬ 
cence time equals its truncated version, since all lineages coalesce instantly 
at r with probability 1. Thus, Similarly, for k < n, 

= fn,*{k), he. the SFS equals the truncated SFS, because the prob¬ 
ability of a polymorphic mutation occurring in [r, oo) is 0. 

Finally, note that and fn,*{k) = fn{k), because a(t) 

and a*{t) are identical on [0 ,r). 


5.5. Proof of Proposition 1. We start by showing that = m) = 

lPn(AC = m) + O{0). Let T[{IC) = denote the amount of time 

where 1C has i unkilled lineages. Let p denote the probability density func¬ 
tion. For {tn ,. ■ ■, tm) with — have 


p{Tl{lC)=tnm--,Tf,{lC)=tm) 




fL 


2=m+l 


= e-((T)“+^)*™ 


i=m+l 


2 / 2 


= f[ (i)ae-0^^+0{e) 

i=m+l 

= p{T- = t^,.--,Tf, = tm) + o{e), 


and so 


lim ¥n{A^ = m) 
e-s-o 


lim/ p(r^(/C) = tn, • ■ • 

6^0 Jj2U=r 

f P{Tn =tn,---,Tm= tm)dt 


'T,ti= 

Fn{A'; = m). 


where we can exchange the limit and the integral by the Bounded Conver¬ 
gence Theorem, because p(r^ (/C) = tn,---, Tfai^) = tm) < OLm-Hi ((D“ + 2 ) 
for 6 < 1. 

Thus we have 


Fni\M^\=k,A^ = m) 


Fni\M^\ =k\A^ = m)Fn{A^ = m) 

(^^fnik I kl^ = m) + o(0)^ {Fn{A^r = m) + 0{e)) 

^fnik I = rn)Fn{A^r = m) + o{e), 
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which proves the first part of the proposition. 

We next solve for f^{k \ = m), the first order Taylor series coefficient 

for = k I = m) in the mutation rate 

When there are i unkilled lineages, the probability that the next event is 
a killing event is + o{9). Given that the event is a killing, 

the chance that the killed lineage has k leaf descendants is p^\. So summing 
over i, and dividing out the mutation rate we get 

rn{k \A'^ = m) 


where we made the change of variables j = n — k — i + 1, and where the 
final line follows from repeated application of the combinatorial identity 

0 = rA + 

5.5.1. Alternative proof for ff{k \ A^ = m) via the Chinese Restaurant 
Process. We sketch an alternative proof of the expression for ff{k \ A^ = 
m), using the Chinese Restaurant Process. 

Consider the coalescent with killing going forward in time (towards the 
present), and only looking at it when the number of individuals increases. 
Then when there are i lineages, a new mutation occurs with probability 

lineage branches with probability 

Thus, conditional on A^ = m, the distribution on ICr is given by a Chinese 




22 


KAMM, TERHORST AND SONG 


Restaurant Process [2] , starting with m tables each with 1 person, and with 
new tables founded with parameter 9/a. 

Let (x)j'|- = x{x + 1) • • • (x + i — 1) denote the rising factorial. If there is 
a single mutation with k descendants, then there are ways to pick 

which of the n — m events involve mutant lineages. The probability of a 
particular such ordering is 

9 _ 9 {k - l)\{n - k - l)\/m\ ^ 

a{m + 9/a)n-m\ « (n —l)!/m! 

Summing over all orderings, and dividing by yields 

fT/, I ^ 2 /n - m\ (fc- l)!(n-fc - l)!/m! 

^ ^ a\ k ) (n-l)!/m! 
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